This project is a tutorial of using multivaraite time series analysis for the stock market index, NIFTY 50 from NSE (National Stock Exchange) India. The data is obtained from Nifty 50 contains price history and trading volumes of fifty stocks in India from 2000-01-03 to 2020-09-30.
We illustrates how to using Python, R, and Stata to apply Auto Regressive Integrated Moving Average (ARIMA) to time series data. ARIMA is able to fit a given non-seasonal non-stationary time series based on its lag values. A general ARIMA model consists of three parts: the “AR” part means the variable of interest is regressed on its lag terms, the “I” part means the differenced values are used, and the “MA” part means the regression error is modelled as a linear combination of error terms in the past. The purpose of using differenced terms is to make the time series stationary for autoregression.
A general ARIMA model consists of three parts: the “AR” part means the variable of interest is regressed on its lag terms, the “I” part means the differenced values are used, and the “MA” part means the regression error is modelled as a linear combination of error terms in the past. The purpose of using differenced terms is to make the time series stationary for autoregression.
An ARIMA model is characterized by 3 terms: p (the order of AR term), q (the order of the MA term), and d (number of differencing to make time series stationary). Given a time series \(\{X_t\}\), an \(ARIMA(p, d, q)\) model can be expressed as: \[(1-\sum_{i=1}^p\phi_iL^i)(1-L)^dX_t= (1+\sum_{i=1}^q\theta_iL^i)\epsilon_t + \delta\] where \(\epsilon_t\) is the error term, \(L\) is the lag operator, i.e. \(LX_t = X_{t-1}, \forall t>1\), \(p\) is the number of lagged terms of \(X\), \(d\) is the number of times of differencing needed for stationarity, \(q\) is the number of lagged forecast errors in prediction, \(\delta\) is the interception term for the regression, and \(\theta, \phi\)’s are the estimated regression coefficients.
ARIMA models are fitted in order to understand the data better and forecast future data. They are based on linear regression models. The best model can be chosen using AIC or BIC.
NIFTY 50 data consist of 50 stocks, 230104 observations on 15 variables. The data contains daily open, close, highest and lowest prices, volume and other relevant information for the “Nifty Fifty” stocks since January 2000. Detailed variable descriptions are shown in Table 1 below.
| Variable Name | Variable Description |
|---|---|
| Date | Date of trade |
| Symbol | Name of the company |
| Series | We have only one series: Equity(EQ) |
| Prev Close | Previous day’s close price |
| Open | Open price of day |
| High | Highest price in day |
| Low | Lowest price in day |
| Last | Last traded price in day |
| Close | Close price of day |
| VWAP | Volume Weighted Average Price |
| Volume | A measure of sellers versus buyers of a particular stock |
| Turnover | The number of shares available for sale |
| Trades | The number of shares traded |
| Deliverable Volume | Shares which are actually transferred among demat accounts |
| %Deliverable | Percent of deliverble volume |
| Return | Return of trade |
As we are more interested in the stock prices, we use the variable VWAP for the most part. It can summarize the average price of the stock on a trading day. We want to catch the trend of stock prices across the years and possibly forecast future stock prices.
library(dplyr)
library(ggplot2)
old_sym = c('MUNDRAPORT', 'UTIBANK', 'BAJAUTOFIN', 'BHARTI', 'HEROHONDA',
'HINDALC0', 'HINDLEVER', 'INFOSYSTCH', 'JSWSTL', 'KOTAKMAH', 'TELCO',
'TISCO', 'UNIPHOS', 'SESAGOA', 'SSLT', 'ZEETELE')
new_sym = c('ADANIPORTS', 'AXISBANK', 'BAJFINANCE', 'BHARTIARTL', 'HEROMOTOCO',
'HINDALCO', 'HINDUNILVR', 'INFY', 'JSWSTEEL', 'KOTAKBANK', 'TATAMOTORS',
'TATASTEEL', 'UPL', 'VEDL', 'VEDL', 'ZEEL')
# summary statistics of variable of interest
nifty = nifty %>%
mutate(Symbol = lapply(Symbol,
function(x) replace(x, x %in% old_sym, new_sym)),
Trades = ifelse(is.na(Trades), 0, Trades),
Return = Close - `Prev Close`) %>%
select(Date, Symbol, Return, VWAP, Volume, Trades)
summary(nifty[, -2])
## Date Return VWAP
## Min. :2000-01-03 Min. :-19525.95 Min. : 9.21
## 1st Qu.:2006-05-23 1st Qu.: -5.80 1st Qu.: 272.43
## Median :2011-06-07 Median : 0.05 Median : 554.00
## Mean :2011-02-20 Mean : 0.25 Mean : 1224.17
## 3rd Qu.:2016-02-05 3rd Qu.: 6.40 3rd Qu.: 1213.47
## Max. :2020-09-30 Max. : 2107.70 Max. :32975.24
## Volume Trades
## Min. : 3 Min. : 0
## 1st Qu.: 209105 1st Qu.: 0
## Median : 973757 Median : 235
## Mean : 2745270 Mean : 28762
## 3rd Qu.: 2836929 3rd Qu.: 41776
## Max. :481058927 Max. :1643015
# plot trend of all stocks
fig.cap1 = "**Figure 1.** Daily trend of all stocks, 2000-2020."
nifty_ts = reshape2::melt(nifty[, -c(2)], id.vars = "Date")
ggplot(nifty_ts, aes(x = Date, y = value)) +
geom_line(color="lightblue") +
theme_bw() +
facet_wrap(~ variable, scales = "free_y", ncol = 1)
Figure 1. Daily trend of all stocks, 2000-2020.
From the trend of all stocks, we can see the time series exhibit non-stationarity. There was a substantial strike to the India stock market after the outbreak of Coronavirus.
The main packages used in this analysis are:
pandas: For data cleaning and data frame mutationsnumpy: For numeric data processingdatetime: For date format data processingsklearn: For missing data imputingstatsmodels and pmdarima: For the main ARIMA modelmatplotlib: For making plotsFirstly, we import the data.
import sys
sys.path.append('/usr/local/lib/python3.8/site-packages')
# Just to make sure the packages are loaded.
import pandas as pd
import numpy as np
df = pd.read_csv('./NIFTY50_all.csv')
The variable Trades has 50% missing values, we can delete it. We can delete redundant variables and impute %Deliverable via Simple Imputation because there are not so many missing values (about 7%). We also need to merge different symbols for the same stock.
from datetime import datetime # Dealing with date format data
from sklearn.impute import SimpleImputer # Impute data
# Drop redundant variables and variables with too many missing values
df['Date'] = [datetime.strptime(x, '%Y-%m-%d') for x in df['Date']]
df1 = df.drop(['Trades', 'Deliverable Volume', 'Series'], axis=1)
ls1 = ['MUNDRAPORT', 'UTIBANK', 'BAJAUTOFIN', 'BHARTI', 'HEROHONDA',
'HINDALC0', 'HINDLEVER', 'INFOSYSTCH', 'JSWSTL', 'KOTAKMAH', 'TELCO',
'TISCO', 'UNIPHOS', 'SESAGOA', 'SSLT', 'ZEETELE']
ls2 = ['ADANIPORTS', 'AXISBANK', 'BAJFINANCE', 'BHARTIARTL', 'HEROMOTOCO',
'HINDALCO', 'HINDUNILVR', 'INFY', 'JSWSTEEL', 'KOTAKBANK', 'TATAMOTORS',
'TATASTEEL', 'UPL', 'VEDL', 'VEDL', 'ZEEL']
df1['Symbol'] = df1['Symbol'].replace(ls1, ls2)
df1['Symbol'] = pd.Categorical(df1['Symbol'])
# Impute missing values
df2 = pd.get_dummies(data=df1, drop_first=True)
df2['Date']=df2['Date'].map(datetime.toordinal)
imp = SimpleImputer()
p = imp.fit_transform(df2)
df1['%Deliverble'] = p[:, 10]
We can take a look at the data. Take the stock “ADANIPORTS” as an example.
import matplotlib.pyplot as plt # Plotting package in python
names = df1['Symbol'].cat.categories
example = df1[df1['Symbol'] == names[0]]
fig, ax = plt.subplots(3, 1, figsize=(8, 8))
ax[0].plot(example['Date'], example['VWAP'])
ax[0].set_xticks([])
ax[0].set_xlabel('Days')
ax[0].set_ylabel('Volume weighted average price')
ax[1].plot(example['Date'], example['Volume'])
ax[1].set_xticks([])
ax[1].set_xlabel('Days')
ax[1].set_ylabel('Volume')
ax[2].plot(example['Date'], example['Turnover'])
ax[2].set_xticks([])
ax[2].set_xlabel('Days')
ax[2].set_ylabel('Turnover')
ax[0].set_title('Time series plots of stock %s' % names[0])
We will use the time series VWAP for the analysis below.
The differencing parameter \(d\) of the model can be determined by doing Augmented Dickey-Fuller tests, which can indicate whether the time series are stationary. See the reference for more details about ADF tests. The python packages statsmodels.tsa and pmdarima.arima are very helpful here.
from statsmodels.tsa.arima_model import ARIMA
from pmdarima.arima.utils import ndiffs # ARIMA model packages
# This function chooses the smallest d for the series to be stationary
names = df1['Symbol'].cat.categories
ls0 = []
for i in names:
subdf = df1[df1['Symbol'] == i]
# Select the rows for stock i
ls0.append(ndiffs(subdf['VWAP'], test='adf'))
ls0 # Most values are 1
## [0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
max(ls0) # We don't need 2-order differencing
## 1
In order to ensure all time series are stationary, we choose \(d=1\) for all stocks.
The AR parameter \(p\) of the model can be determined by looking at Partial Autocorrelation plots. These plots indicate the correlation between the series and its lag. We use the first 4 stocks alphabetically as a sample. Notice the series need to be differenced first (\(d=1\)).
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(1, 4, figsize=(15, 5))
# Take the first 4 stocks as a sample
for i in range(4):
subdf = df1[df1['Symbol'] == names[i]]
# Select the rows for stock number i
plot_pacf(subdf['VWAP'].diff().dropna(), ax=ax[i])
ax[i].set_title('PACF plot for stock %s' % names[i])
Notice lag 1 is at least borderline significant in the plots for all stocks, but lag 2 is not significant. Therefore, we can choose \(p=1\) for all stocks.
The MA parameter \(q\) of the model can be determined by looking at Autocorrelation (ACF) plots. We use the next 4 stocks alphabetically as a sample. Similarly, the series need to be differenced (\(d=1\)).
fig, ax = plt.subplots(1, 4, figsize=(15, 5))
for i in range(4):
subdf = df1[df1['Symbol'] == names[i+4]]
plot_acf(subdf['VWAP'].diff().dropna(), ax=ax[i])
ax[i].set_title('ACF plot for stock %s' % names[i+4])
Notice lag 1 is significant for most of the stocks but lag 2 is not. Therefore we can choose \(q=1\) for the MA term.
According to the process above, we choose the \(ARIMA(1, 1, 1)\) for all stocks. After fitting the models, we can view some of the summaries. Take the first 5 stocks alphabetically as an example.
mlist = [] # Models
flist = [] # Model fits
for i in names:
subdf = df1[df1['Symbol'] == i]
m = ARIMA(list(subdf['VWAP']), order=(1, 1, 1))
mlist.append(m)
flist.append(m.fit(disp=0))
for i in range(5):
print(flist[i].summary())
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 3178
## Model: ARIMA(1, 1, 1) Log Likelihood -13340.537
## Method: css-mle S.D. of innovations 16.100
## Date: Tue, 17 Nov 2020 AIC 26689.074
## Time: 21:08:29 BIC 26713.330
## Sample: 1 HQIC 26697.773
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -0.2048 0.314 -0.652 0.514 -0.820 0.411
## ar.L1.D.y 0.1049 0.151 0.694 0.488 -0.192 0.401
## ma.L1.D.y -0.0156 0.152 -0.103 0.918 -0.313 0.282
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 9.5325 +0.0000j 9.5325 0.0000
## MA.1 64.1530 +0.0000j 64.1530 0.0000
## -----------------------------------------------------------------------------
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 5162
## Model: ARIMA(1, 1, 1) Log Likelihood -29071.318
## Method: css-mle S.D. of innovations 67.549
## Date: Tue, 17 Nov 2020 AIC 58150.637
## Time: 21:08:29 BIC 58176.833
## Sample: 1 HQIC 58159.803
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 0.3108 0.960 0.324 0.746 -1.570 2.192
## ar.L1.D.y -0.1743 0.368 -0.473 0.636 -0.896 0.547
## ma.L1.D.y 0.1985 0.366 0.542 0.588 -0.519 0.916
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 -5.7387 +0.0000j 5.7387 0.5000
## MA.1 -5.0380 +0.0000j 5.0380 0.5000
## -----------------------------------------------------------------------------
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 5162
## Model: ARIMA(1, 1, 1) Log Likelihood -24320.029
## Method: css-mle S.D. of innovations 26.908
## Date: Tue, 17 Nov 2020 AIC 48648.059
## Time: 21:08:29 BIC 48674.255
## Sample: 1 HQIC 48657.225
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 0.0766 0.399 0.192 0.848 -0.706 0.859
## ar.L1.D.y -0.0007 0.237 -0.003 0.998 -0.465 0.464
## ma.L1.D.y 0.0665 0.237 0.281 0.779 -0.397 0.530
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 -1485.7047 +0.0000j 1485.7047 0.5000
## MA.1 -15.0293 +0.0000j 15.0293 0.5000
## -----------------------------------------------------------------------------
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 3058
## Model: ARIMA(1, 1, 1) Log Likelihood -15798.297
## Method: css-mle S.D. of innovations 42.406
## Date: Tue, 17 Nov 2020 AIC 31604.593
## Time: 21:08:29 BIC 31628.695
## Sample: 1 HQIC 31613.254
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 0.7437 0.831 0.895 0.371 -0.885 2.372
## ar.L1.D.y -0.2225 0.127 -1.758 0.079 -0.470 0.026
## ma.L1.D.y 0.3245 0.122 2.654 0.008 0.085 0.564
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 -4.4950 +0.0000j 4.4950 0.5000
## MA.1 -3.0812 +0.0000j 3.0812 0.5000
## -----------------------------------------------------------------------------
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 3057
## Model: ARIMA(1, 1, 1) Log Likelihood -17217.768
## Method: css-mle S.D. of innovations 67.579
## Date: Tue, 17 Nov 2020 AIC 34443.535
## Time: 21:08:29 BIC 34467.636
## Sample: 1 HQIC 34452.196
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 1.7395 1.497 1.162 0.245 -1.195 4.674
## ar.L1.D.y -0.1671 0.071 -2.350 0.019 -0.307 -0.028
## ma.L1.D.y 0.4298 0.066 6.537 0.000 0.301 0.559
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 -5.9832 +0.0000j 5.9832 0.5000
## MA.1 -2.3266 +0.0000j 2.3266 0.5000
## -----------------------------------------------------------------------------
Notice that for the first 3 stocks, the model fit is not good. We can consider removing the MA term since it’s non-significant for some stocks, i.e. choosing the \(ARIMA(1, 1, 0)\) model.
mlist0 = [] # Models
flist0 = [] # Model fits
for i in names:
subdf = df1[df1['Symbol'] == i]
m = ARIMA(list(subdf['VWAP']), order=(1, 1, 0))
mlist0.append(m)
flist0.append(m.fit(disp=0))
We use AIC has the model choosing criteria. The AIC decreases for the first three stocks, and increases for the 4th and 5th, indicating different stocks need different models.
includema = [] # Whether MA term should be included
for i in range(50):
includema.append(flist0[i].aic > flist[i].aic)
pd.value_counts(includema)
## True 28
## False 22
## dtype: int64
Firstly, we can use the in-sample lagged values to predict the time series.
fig, ax = plt.subplots(10, 5, figsize=(15, 20))
for i in range(50):
if(includema): # ARIMA(1,1,1)
flist[i].plot_predict(dynamic=False, ax=ax[i // 5, i % 5])
else: # ARIMA(1,1,0)
flist0[i].plot_predict(dynamic=False, ax=ax[i // 5, i % 5])
ax[i // 5, i % 5].set_title(names[i])
fig.tight_layout()
fig
We can also forecast future VWAPs using the chosen models. For example, we can forecast the average prices in the next 200 trading days after the time series.
fig, ax = plt.subplots(10, 5, figsize=(15, 20))
for i in range(50):
if(includema):
forecast, b, ci = flist[i].forecast(200, alpha=0.05)
else:
forecast, b, ci = flist0[i].forecast(200, alpha=0.05)
subdf = df1[df1['Symbol'] == names[i]]
ax[i // 5, i % 5].plot(list(subdf['VWAP']))
idx = range(len(subdf['VWAP']), 200+len(subdf['VWAP']))
ax[i // 5, i % 5].plot(idx, forecast)
ax[i // 5, i % 5].fill_between(idx, ci[:, 0], ci[:, 1],
alpha=0.15)
ax[i // 5, i % 5].set_title(names[i])
ax[i // 5, i % 5].set_xticks([])
fig.tight_layout()
fig
Notice the confidence intervals are very wide, indicating it’s not easy to forecast stock prices.
Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.
We can use auto_arima to choose models. It compares different models and chooses the best one. Again, we use ADF test to determine \(d\), and AIC to determine \(p,q\). Take “ADANIPORTS”, “ASIANPAINT” and “BPCL” as examples.
import pmdarima as paim
subdf = df1[df1['Symbol'] == names[0]]
m1 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
max_p=3, max_q=3)
m1.summary() # The chosen model was ARIMA(1, 0, 1), which is a good fit.
## <class 'statsmodels.iolib.summary.Summary'>
## """
## SARIMAX Results
## ==============================================================================
## Dep. Variable: y No. Observations: 3179
## Model: SARIMAX(1, 0, 1) Log Likelihood -13346.144
## Date: Tue, 17 Nov 2020 AIC 26700.287
## Time: 21:08:50 BIC 26724.545
## Sample: 0 HQIC 26708.987
## - 3179
## Covariance Type: opg
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## intercept 1.1366 1.144 0.993 0.321 -1.106 3.379
## ar.L1 0.9971 0.002 572.414 0.000 0.994 1.001
## ma.L1 0.0892 0.005 16.587 0.000 0.079 0.100
## sigma2 259.0152 0.418 619.397 0.000 258.196 259.835
## ===================================================================================
## Ljung-Box (Q): 52.63 Jarque-Bera (JB): 120511061.15
## Prob(Q): 0.09 Prob(JB): 0.00
## Heteroskedasticity (H): 0.06 Skew: -22.97
## Prob(H) (two-sided): 0.00 Kurtosis: 955.73
## ===================================================================================
##
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
ls0[0] # Indeed, differencing is not needed for the stock 'ADANIPORTS'.
## 0
subdf = df1[df1['Symbol'] == names[1]]
m2 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
max_p=3, max_q=3)
m2.summary() # The chosen model was ARIMA(0, 1, 0), which is a good fit.
# For 'ASIANPAINT', the 1-order difference series is close to a constant.
## <class 'statsmodels.iolib.summary.Summary'>
## """
## SARIMAX Results
## ==============================================================================
## Dep. Variable: y No. Observations: 5163
## Model: SARIMAX(0, 1, 0) Log Likelihood -29072.936
## Date: Tue, 17 Nov 2020 AIC 58147.873
## Time: 21:08:51 BIC 58154.422
## Sample: 0 HQIC 58150.164
## - 5163
## Covariance Type: opg
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## sigma2 4564.8605 1.982 2303.040 0.000 4560.976 4568.745
## ===================================================================================
## Ljung-Box (Q): 42.98 Jarque-Bera (JB): 3632346684.27
## Prob(Q): 0.34 Prob(JB): 0.00
## Heteroskedasticity (H): 4.65 Skew: -60.55
## Prob(H) (two-sided): 0.00 Kurtosis: 4110.73
## ===================================================================================
##
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
subdf = df1[df1['Symbol'] == names[7]]
m3 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
max_p=3, max_q=3, error_action='ignore')
m3.summary() # The chosen model was ARIMA(1, 1, 2), which is a good fit.
# For 'BPCL' second order MA term is needed.
## <class 'statsmodels.iolib.summary.Summary'>
## """
## SARIMAX Results
## ==============================================================================
## Dep. Variable: y No. Observations: 5163
## Model: SARIMAX(1, 1, 2) Log Likelihood -21047.069
## Date: Tue, 17 Nov 2020 AIC 42102.139
## Time: 21:09:09 BIC 42128.335
## Sample: 0 HQIC 42111.305
## - 5163
## Covariance Type: opg
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## ar.L1 0.9703 0.019 51.948 0.000 0.934 1.007
## ma.L1 -0.9021 0.020 -45.739 0.000 -0.941 -0.863
## ma.L2 -0.0779 0.008 -10.104 0.000 -0.093 -0.063
## sigma2 203.7130 0.511 398.595 0.000 202.711 204.715
## ===================================================================================
## Ljung-Box (Q): 31.14 Jarque-Bera (JB): 97447842.12
## Prob(Q): 0.84 Prob(JB): 0.00
## Heteroskedasticity (H): 5.01 Skew: -18.28
## Prob(H) (two-sided): 0.00 Kurtosis: 675.11
## ===================================================================================
##
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
We can improve each of the models invidivually for slightly better forecast performance.
We will focus on the variables: Symbol, VWAP, Volume, Trades and a newly created variable Return, which is the difference between Close and Prev Close. First we choose one stock to analyze its autocorrelation of trend on theabove four variables. The trend plot above shows large value of Volume, sowe can take log to decrease its trend.
library(forecast)
candidate = "ADANIPORTS"
nifty_cand = nifty %>% filter(Symbol == candidate) %>%
mutate(Volume = log(Volume))
vars_list = names(nifty_cand)[-c(1, 2)]
fig.cap2 = "**Figure 3.1.** ACF plots of stock: ADANIPORTS."
par(mfrow=c(2,2))
for ( var in vars_list ) {
df_ts = ts(nifty_cand[[var]], frequency = 1, start = c(2000, 01, 03))
# acf
acf_tou <- acf(df_ts, lag.max = 30, plot = FALSE)
plot(acf_tou, xlab = "Lag (in Year)", main = var)
}
Figure 3.1. ACF plots of stock: ADANIPORTS.
Notice that all the variables show high autocorrelation except for Return, which is because Return is calculated from the first difference of closing price working as a linear filter applied to eliminate a trend. Since we are going to apply ARIMA model to the data, which can works for non-stationary time series, we can leave the model to detrend time seires. For the illustration purpose, we will take first difference of other three variables and compare the autocorrelation plot to the previous one.
fig.cap3 = "**Figure 3.2.** ACF plots of first difference of time seires."
acf_diff_plot(df = nifty, sym = candidate)
Figure 3.2. ACF plots of first difference of time seires.
We can choose the ARIMA Model by AIC. Let’s tabulate some AIC values for a range of different choices of p and q, assuming d takes 0 for Return while 1 for other 3 variables. Below shows the AIC table of fitting ARIMA on Return time series of stock: “ADANIPORTS”. We will subset the last 120 time series as test data.
| MA0 | MA1 | MA2 | MA3 | MA4 | |
|---|---|---|---|---|---|
| AR0 | 336526.9 | 336472.5 | 336467.4 | 336418.2 | 336397.1 |
| AR1 | 336471.1 | 336472.2 | 336442.9 | 336160.5 | 336146.7 |
| AR2 | 336470.3 | 336449.3 | 336458.8 | 335870.8 | 335872.1 |
| AR3 | 336415.3 | 336121.2 | 335870.7 | 335872.7 | 335784.7 |
| AR4 | 336400.3 | 336095.4 | 336125.2 | 335783.9 | 335522.1 |
The AIC table suggests that ARIMA(4, 0, 4) is the best model for the return of “ADANIPORTS”. This model may give a sign that increasing p and q will tend to get smaller AIC for a better fit. However, models with higher p and q are more complex, so it may lead to problems like overfitting, numerical stability and etc. We usually prefer a simply model, which also better for interpretation.
Even though it is nice to view the change of AIC value as the change of p and q, for a big dataset like this, it is very inefficient to iterate over range of p and q. auto.arima()in the forest package is much faster in generating the results. We can verify that auto.arima() also suggests ARIMA(4, 0, 0) is the model with the best fit in the range of (1, 4) of p and q.
ts_arima <- auto.arima(head(nifty_cand_ts, -30), max.p = 4,
max.q = 4, max.d = 3)
print(ts_arima)
## Series: head(nifty_cand_ts, -30)
## ARIMA(4,0,4) with zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2 ma3 ma4
## 0.0755 1.0253 0.0893 -0.7314 -0.0316 -1.0385 -0.2077 0.7288
## s.e. 0.0183 0.0157 0.0150 0.0128 0.0195 0.0160 0.0145 0.0146
##
## sigma^2 estimated as 5155: log likelihood=-167751.7
## AIC=335521.5 AICc=335521.5 BIC=335596.1
Lastly, we will forecast the next 120 time series and compare the result with our test set.
ts_forecasts <- forecast(ts_arima, h = 30)
acc <- accuracy(ts_forecasts, head(tail(nifty_cand_ts, 30), 7))
print(round(acc, 4))
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.4800 71.7890 15.4322 NaN Inf 0.7424 -0.0081
## Test set 3.0744 5.5276 4.5046 -49.0596 206.901 0.2167 -0.3585
## Theil's U
## Training set NA
## Test set 0.9729
The RMSE and MAE for the test set are 5.5276 and 4.5046, respectively.
fig.cap4 = "**Figure 3.3.** Forecasts from ARIMA(4, 0, 4) with zero mean."
autoplot(ts_forecasts, main = "") + xlab("Day") +
ylab("Return") +
theme_bw()
Figure 3.3. Forecasts from ARIMA(4, 0, 4) with zero mean.
Firstly, we import the data and do data cleanning. We drop the variable and %Deliverable. Also, we transform the variable from string type to date type to treat the whole data set as time series data set.
import delimited NIFTY50_all.csv, clear
* Data Cleaning
gen date2 = date(date, "YMD")
format date2 %tdCCYY-nn-dd
drop date series
drop trades deliverablevolume
rename date2 date
label variable date "Date"
* Replace Symbol Names
replace symbol = "ADANIPORTS" if symbol == "MUNDRAPORT"
replace symbol = "AXISBANK" if symbol == "UTIBANK"
replace symbol = "BAJFINANCE" if symbol == "BAJAUTOFIN"
replace symbol = "BHARTIARTL" if symbol == "BHARTI"
replace symbol = "HEROMOTOCO" if symbol == "HEROHONDA"
replace symbol = "HINDALCO" if symbol == "HINDALC0"
replace symbol = "HINDUNILVR" if symbol == "HINDLEVER"
replace symbol = "INFY" if symbol == "INFOSYSTCH"
replace symbol = "JSWSTEEL" if symbol == "JSWSTL"
replace symbol = "KOTAKBANK" if symbol == "KOTAKMAH"
replace symbol = "TATAMOTORS" if symbol == "TELCO"
replace symbol = "TATASTEEL" if symbol == "TISCO"
replace symbol = "UPL" if symbol == "UNIPHOS"
replace symbol = "VEDL" if symbol == "SESAGOA"
replace symbol = "VEDL" if symbol == "SSLT"
replace symbol = "ZEEL" if symbol == "ZEETELE"
* Save the cleaned data
save NIFTY_clean, replace
Then we visualize the data and the stock “ADANIPORTS” is taken as an example.
use NIFTY_clean, clear
keep if symbol == "ADANIPORTS"
graph twoway line vwap date, color("blue") xtitle("Days") ///
ytitle("Volume weighted average price")
graph export vwap_data.png, replace
graph twoway line volume date, color("blue") xtitle("Days") ytitle("Volume")
graph export vwap_data.png, replace
graph twoway line turnover date, color("blue") xtitle("Days") ///
ytitle("Turnover")
graph export vwap_data.png, replace
ada_vwap
We will use the time series VWAP for the analysis below.
For all stocks, we do Augmented Dickey-Fuller tests to determine whether the time series are stationary or not.
use NIFTY_clean, clear
foreach sym of local sbls {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
dfuller d1.vwap
}
We do the test on the with the first-order differentiation. All stocks are reporting minimum p-values, hence we decide to use \(d=1\) for all stocks.
Then, in order to find AR parameter \(p\) of the model, we generate the partial autoregressive (PACF) plots together with autoregressive (ACF) plots.
Note: we will only plot the first 8 stocks as an example.
use NIFTY_clean, clear
levelsof(symbol), local(sbls)
local sbls_f8 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV ///
BAJFINANCE BHARTIARTL BPCL"
foreach sym of local sbls_f8 {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
ac D.vwap
graph export acf_`sym'.png
pac D.vwap
graph export pacf_`sym'.png
}
The PACF plots for the first 4 stocks are the following:
And the ACF plots for the next 4 stocks are the following:
We can get the similar conclusion that lag 1 is absolutely significant while lag 2 is not, hencewe can choose \(p=1\) for the AR term and \(q=1\) for the MA term for all stocks.
According to the process above, we choose the \(ARIMA(1, 1, 1)\) for all stocks. However, diagnostics tells sometimes the \(ARIMA(1, 1, 0)\) performs better for some stocks. Hence, we try to use the better model to fit the data and then plot the predicted values against original values.
Note: we will only plot the first 5 stocks as an example.
use NIFTY_clean, clear
local sbls_f5 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV"
foreach sym of local sbls_f5 {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
arima vwap, arima(1,1,1)
estat ic
mat l_aim = r(S)
scalar aic_aim = l_aim[1,5]
arima vwap, arima(1,1,0)
estat ic
mat l_ai = r(S)
scalar aic_ai = l_aim[1,5]
if aic_aim > aic_ai {
arima vwap, arima(1,1,0)
predict vwap_p
gen vwap_p = vwap_pd + vwap
graph twoway line vwap date, lwidth("vthin") color("blue")///
|| line vwap_p date, lwidth("vthin") color("red") lpattern("dash")
graph export fitted_`sym'.png
}
else {
arima vwap, arima(1,1,1)
predict vwap_pd
gen vwap_p = vwap_pd + vwap
graph twoway line vwap date, lwidth("vthin") color("blue")///
|| line vwap_p date, lwidth("vthin") color("red") lpattern("dash")
graph export fitted_`sym'.png
}
}
The sample fitted graphs are:
Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.
However, Stata does not have some similar funciton as auto_arima to choose models automatically. Hence, we may related to other two languages ( Python, R). Heavy and tedious computation is expected in Stata here.